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distance between z and w, (or between a and is approximated to be the distance between the imd^Sints 
of tfae intervals in which they lie. Also, the density fiinction fx(a) is approximated to be the average of the 
density function in the interval in which the attribute "a" lies. 
With dtis hi nuod, 

Pr'(X€y = (1/n) X (over s»l to m) of {NQ x [(fy(ra(y.m(Ip))Pr(X€y)l / [X(over t=l to m) 
of (fY(m(I>m(y)Pr(X€U)]. where 

I(x) is the interval in which "x" lies, in(y is the midpoint of the interval Ip, and f(Ip) is the 
average value of the density function over the interval Ip, p=»l,...m. 



Using the preferred method of partitioning mto intervals, the step at block 46 can be undertaken in 
0(in^ time. It is noted that a naive iniplementation of the last of the above equations will lead to a 
processing time of 0(m^; however, because the denominator is independent of I,, the results of that 
conpitation are reused to achieve 0(m^ time. In the presently preferred embodiment, the number "m" of 
intervals is selected such that there are an average of 100 data points in each interval, with "ra" being bound 
10<.m<.100. 

It is next determined at decision diamond 48 whether the stow>ing criterion for the iterative process 
disclosed above has been met. In one preferred embodiment, the iteration is stopped when the reconstructed 
distribution is statistically the same as the original distribution as indicated by a X' goodness of fit test. 
However, since the true original distribution is not known, the observed randomized distribution (of the 
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perturbed data) is compared with the result of the cunent estimation for the reconstructed distribution, and 
when the two are statisticaUy the same, the stopping criterion has been met, on the intuition that if these two 
are close, the current estimation for the reconstructed distribution is also close to the original distribution. 

When the test at decision diamond 48 is negative, the integration cycle counter "j" is incremented at 
block 50. and the process loops back to block 46. Odierwise. the process ends at block 52 by returning the 
reconstructed distribution. 

Now refiwring to Figure 5, the logic for constructing a decision tree classifier using the reconstructed 
distribution is seen. Commencing at block 54, for each attribute in the set "S" of data points, a DO loop is 
entered. Moving to block 56, split points for partitioning the data set "S" pursuant to growing the data tree 
are evaluated. Preferably, the split points tested are those between intervals, with each candidate split point 
being tested using die so-caUed "gini" index set forth in Classification and Regression Trees Breinian et al., 
WadsworA, Belnaont, 1984. To summarize, for a data set S containing "n" classes (which can be predefined 
by the user, if desired) the "gini" index is given by l-Spj^ where pj is the relative frequency of class "j" in 
the data set "S". For a split dividing "S" into subsets Si and S2, the index of the split is given by; 



index = n,/n(gini(Sl)) + n,/n(gini(S2)), where n, = number of classes in SI and ni = 
number of classes in S2. 



The data points are associated with die intervals by sorting the values, and assigning the N(I,) lowest 
values to flie first interval, die next highest values to die next interval, and so on. 
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